Univariate Exploration
Bivariate Exploration
Multivariate Exploration
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
Note that this dataset will require some data wrangling in order to make it tidy for analysis.
There are 174,951 rows represting 174951 rides during Feb 2019 in Great San Francisco (San Francisco, Oakland, San Jose) and 15 columns
Gobike completed close to 200 thousand rides in San Francisco in one month (Fed 2019), which shows us a great need of short/medium bike transportation. So firstly, it is interesting to draw insights from share-mobility over time. But since we only have one month data, it is still worthy to take a look at the trip frequency by day of the week, or the hour of the day, etc....
Secondly, we could also take a look at the customer segmentation. We have some user feature like, age, gender, subscribed or not, in the program 'bike share for all' or not ...
Finally, we could look at the popularity of stations and may could help making business decision like where to add a station or add more bikes at certain station.
Trip Duration, Start/End Time and Date, Latitude and Longitude, User Type, Gender, Year of Birth, bike share for all trip...
# import all packages and set plots to be embedded inline
import pprint
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_context("paper")
%matplotlib inline
import folium
from folium.plugins import MarkerCluster
import geopandas as gpd
from datetime import datetime
import missingno as msno
from PIL import Image, ImageDraw, ImageFont
import matplotlib.colors
from matplotlib.colors import LinearSegmentedColormap, rgb_to_hsv, hsv_to_rgb
import scipy.ndimage.filters
gb = pd.read_csv('201902-fordgobike-tripdata.csv')
start_time and stop_time is object, we convert them to datetimestart_station_id, end_station_id and bike_id should be changed to integer.start_station_name, start_station_id, end_station_name, end_station_idmember_birth_year# check data shape and data type
# look at the missing value
print(gb.shape)
print(gb.info())
print(gb.isnull().sum())
msno.matrix(gb)
# check numeric variables
print(gb.describe())
# check categorical variables
print(gb.user_type.value_counts())
print(gb.member_gender.value_counts())
print(gb.bike_share_for_all_trip.value_counts())
# Data Cleaning 1. Correct data type
gb['start_time'] = pd.to_datetime(gb['start_time'])
gb['end_time'] = pd.to_datetime(gb['end_time'])
gb['start_station_id'] = gb['start_station_id'].astype('Int64')
gb['end_station_id'] = gb['end_station_id'].astype('Int64')
gb['bike_id'] = gb['bike_id'].astype('Int64')
gb['member_birth_year'] = gb['member_birth_year'].astype('Int64')
# Data Cleaning 2. missing values
gb = gb.dropna(subset = ['start_station_id'])
#gb.dropna(subset = ['start_station_id']).shape
# Data Cleaning 3. Erroneous value on `member_birth_year`
# gb = gb[gb.member_birth_year >= 1900] delet all the missing values
gb = gb.drop(gb[gb.member_birth_year < 1900].index)
gb.info()
gb.head(5)
gb['hour'] = gb['start_time'].apply(lambda x: x.hour)
conditions = [(gb.start_station_latitude < 37.837769) &
(gb.start_station_latitude > 37.633183) &
(gb.start_station_longitude < -122.336283) &
(gb.start_station_longitude > -122.603469),
(gb.start_station_latitude < 37.457020) &
(gb.start_station_latitude > 37.221726) &
(gb.start_station_longitude < -121.718031) &
(gb.start_station_longitude > -122.181992),
(gb.start_station_latitude < 37.890247) &
(gb.start_station_latitude > 37.738964) &
(gb.start_station_longitude < -122.200436) &
(gb.start_station_longitude > -122.356161)]
values = ['san_fran', 'san_jose', 'oakland']
gb['city'] = np.select(conditions, values)
gb['duration_min'] = round(gb['duration_sec']/60, 2)
gb['DoW'] = gb['start_time'].dt.dayofweek
gb['age'] = 2019 - gb['member_birth_year']
gb['age_group'] = pd\
.cut(gb.age, 20, right = True, precision = 0, include_lowest = True)
import math
def distance(lat1, lon1, lat2, lon2):
"""
Calculate the Haversine distance.
Parameters
----------
origin : tuple of float
(lat, long)
destination : tuple of float
(lat, long)
Returns
-------
distance_in_km : float
Examples
--------
>>> origin = (48.1372, 11.5756) # Munich
>>> destination = (52.5186, 13.4083) # Berlin
>>> round(distance(origin, destination), 1)
504.2
"""
radius = 6371 # km
dlat = math.radians(lat2 - lat1)
dlon = math.radians(lon2 - lon1)
a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
math.sin(dlon / 2) * math.sin(dlon / 2))
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = radius * c
return round(d, 2)
gb['distance_km'] = gb\
.apply(lambda x: distance(x['start_station_latitude'],
x['start_station_longitude'],
x['end_station_latitude'],
x['end_station_longitude']), axis = 1)
Looking at the net departure and net arrival by stations by day and houra
pre-processing data: generate a dataframe with locations (name, latitude, longitude) of stations, and count the ride arrival and departure by hour.
Plot: make some basic plots and use circle markers to visualize net arrivals and net departures.
def arr_dep_count_by_time(select_day, select_hour):
temp1 = gb\
.groupby('start_station_id')\
.first()\
.loc[:, ['start_station_name', 'start_station_latitude', 'start_station_longitude']]
temp2_start_count = gb[(gb.hour == select_hour) & gb.DoW.isin(select_day)]\
.groupby('start_station_id')\
.size()\
.to_frame('departure_count')
temp3_end_count = gb[(gb.hour == select_hour) & gb.DoW.isin(select_day)]\
.groupby('end_station_id')\
.size()\
.to_frame('arrival_count')
trip_count = temp1.join(temp2_start_count).join(temp3_end_count)
trip_count['departure_count'] = np\
.where(trip_count['departure_count'].isnull(), 0, trip_count['departure_count'])
trip_count['arrival_count'] = np\
.where(trip_count['arrival_count'].isnull(), 0, trip_count['arrival_count'])
return trip_count
def plot_arr_dep(trip_count, zoom_start = 9.3):
# draw the map canvas
map_coords = [37.558639, -122.133339]
folium_map = folium.Map(location = map_coords,
zoom_start = zoom_start,
tiles = 'cartodbpositron')
for index, row in trip_count.iterrows():
# calculate net departures
net_departures = row['departure_count'] - row['arrival_count']
popup_text = "Station: {} <br> Total Departures: {} <br> Total Arrivals: {} <br> Net Departures: {}"\
.format(row['start_station_name'], row['departure_count'], row['arrival_count'], net_departures)
# radius of circles
radius = np.abs(net_departures)/5
# choose the color of the marker
if net_departures > 0:
color = '#ffcc66'
else:
color = '#99ccff'
# add marker
marker = folium.CircleMarker(location = [row['start_station_latitude'], row['start_station_longitude']],
radius = radius,
color = color,
popup = popup_text,
fill = True)
marker.add_to(folium_map)
return folium_map
arr_dep_1_8 = plot_arr_dep(arr_dep_count_by_time([1], 8))
arr_dep_1_17 = plot_arr_dep(arr_dep_count_by_time([1], 17))
from IPython.core.display import display, HTML
htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
'<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
.format(arr_dep_1_8.get_root().render().replace('"', '"'),500,500,
arr_dep_1_17.get_root().render().replace('"', '"'),500,500))
display(htmlmap)
Yellow circle means net departures and blue circle means net arrival. Locations that have positive net departures in the morning rush hour 8am - 9am (left figure) have net arrivals in the evening rush hour 17pm - 18pm (right figure), likely matching the pattern that people going to work in the morning and back home evening. Also We could tell these are 3 main area in the greate SF area, San Francisco, Oakland, and San Jose. So we could add one more column to specific the area (San Francisco, Oakland, San Jose)
pre-processing: Firstly I made locations dataset only containing unique start station information (id, name, latitude, longitude) and define path_id using (start_station_id, end_station_id) from original dataset and join location datase twice using start_station_id and end_station_id to get a oath dataset contaning the number of rides for each path id.
def path_count_by_time(select_day, select_hour):
# declare the coordinates by stations
locations = gb\
.groupby('start_station_id')\
.agg({'start_station_latitude': 'mean',
'start_station_longitude': 'mean',
'start_station_name': 'first'})
# (start_station_id, end_station_id) is a path_id
gb['path_id'] = [(id1, id2) for id1, id2 in zip (gb['start_station_id'], gb['end_station_id'])]
# select by hour and then group by path_id
paths = gb[(gb.hour == select_hour) & (gb.DoW.isin(select_day)) ]\
.groupby('path_id')\
.size()\
.to_frame('trip_counts')
# select only paths with more than min_counts (X) trips default = 0
paths = paths[paths["trip_counts"]>0]
paths["start_station_id"] = paths.index.map(lambda x:x[0])
paths["end_station_id"] = paths.index.map(lambda x:x[1])
paths = paths[paths["start_station_id"]!=paths["end_station_id"]]
# join latitude and longitude into the new table
paths = paths.join(locations, on='start_station_id')
locations.columns = ['end_station_latitude', 'end_station_longitude', 'end_station_name']
paths = paths.join(locations, on='end_station_id')
paths.index = range(len(paths))
return paths
def plot_path_count(paths, zoom_start = 9.3):
folium_map = folium.Map(location=[37.558639, -122.133339],
zoom_start=9.3,
tiles = 'cartodbpositron')
for i, row in paths.iterrows():
line = folium.PolyLine(locations = [(row['start_station_latitude'], row['start_station_longitude']),
(row['end_station_latitude'], row['end_station_longitude'])],
opacity = np.log(row['trip_counts'])/2,
#smooth_factor = row['trip_counts']/20,
weight = np.log(row['trip_counts'])/2,
color = '#0066cc')
line.add_to(folium_map)
popup_text = "Station: {} <br> Total Rides departed from here: {} "\
.format(row['start_station_name'], row['trip_counts'])
marker = folium.CircleMarker(location = [row['start_station_latitude'], row['start_station_longitude']],
radius = 1,
color = '#3366cc',
popup = popup_text)
marker.add_to(folium_map)
return folium_map
path_1_8 = plot_path_count(path_count_by_time([1], 8))
path_6_17 = plot_path_count(path_count_by_time([6], 17))
from IPython.core.display import display, HTML
htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
'<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
.format(path_1_8.get_root().render().replace('"', '"'),500,500,
path_6_17.get_root().render().replace('"', '"'),500,500))
display(htmlmap)
The above are two graph plotting paths on the map, the left one is the paths on 8AM on Tuesday. And the right one is the paths on 17PM on Sunday. These is clear difference. The left shows At 8AM on Tuesday, a hight volumn of path inbounding to the downtown San Francisco, Financial District, Downtwon Oakland, Civic Center and Downtoen San Jose. This is consistent with the station count geo plot bofore, indicating the rides are employed by morning commute. The right plot capturing rides on Sunday evening shows no high vloumn path.
# user segmentation:
# user_type, member_gender, bike_share_for_all_trip,
sb.set_style()
# sb.despine()
fig, axes = plt.subplots(1, 3, figsize = (16, 5), sharey=True)
c = ['user_type', 'member_gender', 'bike_share_for_all_trip']
for i, user_seg in enumerate(c):
sb.countplot(data = gb,
x = user_seg,
ax=axes[i],
order = gb[user_seg].value_counts().index,
alpha = 0.8, color = 'steelblue')
for d, p in enumerate(axes[i].patches):
height = p.get_height()
axes[i].text(p.get_x()+p.get_width()/2., height + 0.1, gb[user_seg].value_counts()[d],ha="center")
axes[i].set_title('Number of {}'.format(user_seg))
axes[i].set_xlabel(user_seg)
sb.despine()
It is ticky to look at user segmentation, since we dont have user id. All the user information is based on the rides, which leads bias due the same user having multiple rides. Therefore all the insights about user segmentation in this pics is based on the total number of rides. We could say among certain 183214 of rides, 163414 rides are made by Subscribers and 19800 made by Customers, which means almost 90% ride were taken by subscribers, in other words, almost 9 out of 10 rides are part of a subscription package. However, it is not saying the actual ratio of between subscriber and casual rider, since the subscriber VS Casual rider ratio should be $\frac{\text{# of subscriber user_id}}{\text{# of casual user_id}}$
As for member gender, despite 8263 rides not containing information of user gender, 130500 rides out of 174951 were taken by males, and 40804 rides are taken by female (Only one in hour ride taken by female). It is fair to assume that the male VS female ratio is not close as well as female are more converned about cosmetic side-effect and mess-up hair during riding. And for Bike Share for All program for low-income residents, there are 17346 out of 183214, around 9% rides are part of bike share for all program.
temp2 = gb.groupby(['age_group', 'member_gender']).size().to_frame('size').reset_index()
temp2 = temp2.pivot(index = 'age_group', columns = 'member_gender', values = 'size')
#define x and y limits
y = range(0, len(temp2))
x_male = temp2['Male']
x_female = temp2['Female']
#define plot parameters
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(8, 6))
#specify background color and plot title
plt.figtext(.5,.9,"Population Pyramid ", fontsize=15, ha='center')
#define male and female bars
axes[0].barh(y, x_male, align='center', color='steelblue')
axes[0].set(title='Males')
axes[1].barh(y, x_female, align='center', color='lightpink')
axes[1].set(title='Females')
#adjust grid parameters and specify labels for y-axis
#axes[1].grid()
axes[0].set(yticks=y, yticklabels=temp2.index)
axes[0].invert_xaxis()
#axes[0].grid()
#display plot
plt.subplots_adjust(bottom=0.15, wspace=0.03)
sb.despine(left = True)
plt.show()
Now we create a poplution pyramid showing the age and gender distriburion of given rides. It is useful to understand the composition of a population and help us see the growth trend. As for gender, we already tell male users is much larger than female users, which show us the potential growth among female. Maybe it is worthy to start marketing campaign to encourage female users. Not supriseingly, the main user are 17 year ago to 40 year ago; which seems like the bike sharing idea is more popular during the younger. However, suprisingly, these were some senior users who ages over 80s. If we treated senior riders as outliers, we could clean data later.
def bar_plot(feature):
temp1 = gb[feature].value_counts().sort_index()
x = np.arange(len(temp1.index))
g = sb.barplot(x = temp1.index,
y = temp1.values,
alpha = 0.8,
color = 'steelblue')
for d, p in enumerate(g.patches):
height = p.get_height()
g.text(p.get_x() + p.get_width()/2., height + 0.1, temp1[d], ha='center', size = 7)
g.set_xticks(x)
g.set_xticklabels(temp1.index)
g.set_title('Number of Rides by {}'.format(feature))
g.set_xlabel('{}'.format(feature))
sb.despine()
plt.figure(figsize = (12, 5))
plt.subplot(1,2,1)
axes[0] = bar_plot('DoW')
plt.xticks(np.arange(7), ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
plt.subplot(1,2,2)
axes[1] = bar_plot('hour')
We look at the trip counts over 2 dimensions, day of week and hour of day. I plot the number of rides over day of week and we could tell that the number of trips on weekdays is aropund double than the number opver the weekends. It seems like that more need to ride short commute or fulfill the last mile to their offices or destinations on weekdays. And for hour of day, we could tell the two peaks around 8 o'clock and 17 o'clock, which indicates that the large needs around the morning and evening rush hour.
plt.figure(figsize = [20, 5])
plt.subplot(1,2,1)
sb.histplot(gb['distance_km'])
plt.title('Dinstance (km)')
sb.despine()
plt.subplot(1,2,2)
# log10 normalization
sb.histplot(np.log(gb.distance_km + 0.01))
These is no such feature representing the trip distance, I calculate the trip distances via start station coordinates and end station coordinates. The distribution is really right skewed, so when I tries to scale the data with log 10 transformation, issue raised. There are 3822 rides with 0 distance, which does not mean it did not move but shows the bikes got returned at the same station after a ride. (eg, taking a bike in front of a station near the office building to pick up a coffee during break and back to the same station) Therefore 0 distance make sence, but it causes problems to use log transformation since 0 will return negative infinite after log. Therefore, I am trying $\log(x+0.01)$ transformation to normalize the distance data. The right plot shows a peak around -4.605 representing the 0 distance rides.
plt.figure(figsize = [20, 5])
plt.subplot(1,2,1)
sb.histplot(gb.duration_sec/60)
plt.title('Trip Duration in min')
sb.despine()
plt.subplot(1,2,2)
# Log10 normalization
bins = len(10**np.arange(1, 5+0.01, 0.01))
ticks = [1, 2, 5, 15, 50, 150, 500,1500]
labels = [v for v in ticks ]
sb.histplot(gb.duration_sec/60, alpha = 0.8, color = 'steelblue');
#g.set_xlabel('log')
plt.xscale('log')
plt.xticks(ticks, labels)
plt.title('Trip Duration in min (log scale)')
sb.despine()
#g.set_xlim(0, 5000)
Then we plot a histogram on trip duration in min. The left plot shows the data is really right skewed. Therefore, I apply log 10 transformation to the data, and plot it on the right. The right plot seems more normal distribution with slight long right tail. We could tell from the plot, the most rides lasted between 1 min to 50 min. Suprisingly, These are some long bike rides over 1 hour.
Investigating distributions of individual variables, I looked at users segmentation, population phyramid, trip count over time, trip duration, and trip distance. There are some unusual points I found during univariate exploration.
The oldest user is over 100 year ago, the longest trip duration is around 24 hours and the shortest trip distance is zero. Of course, we could find reasonal explanation for all these unusual points, there are not impossible so I did keep those rides but delete only one record which user age is greater than 120.
For trip duration and trip distance, I apply different transformations to normalize the data. And after scaling data, the data tends to be more likely the normal distribution.
In order to create the plots I mentioned above, I did some feature engineerin to create some new features, such as
duration_min(divdingduration_secby 60),age(bymember_birth_year),age_group(make cut points forage),DoW(extrect week of day bystart_time),hour(extract hour bystart_time) anddistance_km(calculating the distances between start station coordinates and end station coordinates)
I only delete one records due the over age (141-year-ago seems like a inaccurate input).
temp3_cus = gb[gb.user_type == 'Customer'].groupby(['DoW', 'hour']).size().to_frame('size').reset_index()
temp3_cus = temp3_cus.pivot(index = 'hour', columns = 'DoW', values = 'size').fillna(0)
temp3_sub = gb[gb.user_type == 'Subscriber'].groupby(['DoW', 'hour']).size().to_frame('size').reset_index()
temp3_sub = temp3_sub.pivot(index = 'hour', columns = 'DoW', values = 'size').fillna(0)
plt.figure(figsize = (15, 10))
plt.suptitle('number of rides over time (customers VS subscribers)',
fontsize = 17,
fontweight = 'bold')
plt.subplot(1,2,1)
sb.heatmap(temp3_cus, cmap='RdBu');
plt.xticks(np.arange(0.5, 7.5),
('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
plt.title('customer')
plt.subplot(1,2,2)
sb.heatmap(temp3_sub, cmap='RdBu');
plt.xticks(np.arange(0.5, 7.5),
('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
plt.title('subscriber')
I plot heatmaps, the heatmaps are capturing the number of rides over hour of day and day of week. When we look at the subscriber plot (right plot), there is a clear trajectory of rides taken by subscribers over time, which is that subscribers are taking rides on morning and evening rush hour on weekday. And the left plot is only for customers, same as the subscribers, customers take rides for weekday commuting. and interestingly, customers also have more need on weekend, which indicated that rids taken by customers on weekend may not be pre-scheduled.
temp4_dis = gb.groupby(['age'])['distance_km'].mean().to_frame('mean').reset_index()
temp4_dur = gb.groupby(['age'])['duration_sec'].mean().to_frame('mean').reset_index()
plt.figure(figsize = (15, 6))
plt.subplot(1,2,1)
sb.lineplot(x = 'age', y='mean', data=temp4_dis);
plt.ylabel('distance mean')
plt.title('age VS distance km (mean)')
sb.despine()
plt.subplot(1,2,2)
sb.lineplot(x = 'age', y='mean', data=temp4_dur);
plt.ylabel('duration sec mean')
plt.title('age VS duration sec (mean)')
sb.despine()
I drawed two line plots. The left plot is age vs average trip distance and the right plot is age VS average trip duration. Between 20 year age and 60 year ago, the average distance is relatively stable, and after 60 year ago, the trip distance and trip duratoin have large variance.
I plot 4 plots for bivariate exploration. The first insight I drawed from the heatmaps focusing different user type (subscribers vs consumers), how they behave differently weekly? We could tell subscribers and customers did behave differnetly on weekend, customers rides seems show more needs on the weekend. It is a sign that a marketing campaign fiocus on promoting sharing bike on weekend event.
Secondly, from the lineplots, we learn about how age affect average trip distance and trip duration.
From the geo analysis we did at the first part, we could see the three cluster on the map, San Francisco, Oakland, San Jose. It is also interesting to look at different cities.
plt.figure(figsize = (30, 10))
temp6 = gb.groupby(['city', 'hour', 'DoW']).size().to_frame('size').reset_index()
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
temp6['weekday'] = temp6['DoW'].map(dayOfWeek)
col_order=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sb.set_style("white")
ax = sb.FacetGrid(temp6, col="weekday", col_wrap=7, height=6, col_order=col_order, hue = 'city');
ax.map(sb.lineplot, "hour", "size");
plt.legend()
When Adding the cities (San Francisco, San Jose, Oakland) into the number of rides over time, We could tell all three cities shows rthe similar trends, a high volumn on weekday commute rush hour and weekend afternoon. Among the cities, San Fransciso has the highest shring bike needs, and followed by Oakland and then San Jose.
temp7_sub = gb[gb.user_type == 'Subscriber']\
.groupby(['age_group', 'member_gender'])['distance_km']\
.mean().to_frame('mean').reset_index()
temp7_cus = gb[gb.user_type == 'Customer']\
.groupby(['age_group', 'member_gender'])['distance_km']\
.mean().to_frame('mean').reset_index()
temp8_sub = gb[gb.user_type == 'Subscriber']\
.groupby(['age_group', 'member_gender'])['duration_sec']\
.mean().to_frame('mean').reset_index()
temp8_cus = gb[gb.user_type == 'Customer']\
.groupby(['age_group', 'member_gender'])['duration_sec']\
.mean().to_frame('mean').reset_index()
plt.figure(figsize = (20, 20))
plt.subplot(2,2,1)
sb.lineplot(x = temp8_cus['age_group'].astype(str), y='mean', hue = 'member_gender', data = temp8_cus)
plt.xticks(rotation = 90, fontsize = 7)
plt.title('trip duration (sec) for customer')
sb.despine()
plt.subplot(2,2,2)
sb.lineplot(x = temp8_sub['age_group'].astype(str), y='mean', hue = 'member_gender', data = temp8_sub)
plt.xticks(rotation = 90, fontsize = 7)
plt.title('trip duration (sec) for subscriber')
sb.despine()
plt.subplot(2,2,3)
sb.lineplot(x = temp7_cus['age_group'].astype(str), y='mean', hue = 'member_gender', data = temp7_cus)
plt.xticks(rotation = 90, fontsize = 7)
plt.title('trip distance (km) for customer')
sb.despine()
plt.subplot(2,2,4)
sb.lineplot(x = temp7_sub['age_group'].astype(str), y='mean', hue = 'member_gender', data = temp7_sub)
plt.xticks(rotation = 90, fontsize = 7)
plt.title('trip distance (km) forr subscriber')
sb.despine()
For all four plots, it is suspicious that the member gender of other shows some usual thing for all plotts. It is worthy to investigate it if we have more data. Besides, overall subscribers is generrally taking rides with more distance and more duration. And there is no significant difference between male and female regarding average trip duration and average trip duration.
Just like our univariate exploratoin and bivariate exploration, for user segmention, user type makes a critical influence between customers and subscribers. And for time series, the time (day and hour) is a important key to the number of trip, locatoin and path.
When we take city into account, appealingly all cities beahev differently, it is worthy to take look at three cities seperatly in the furture to understand how user in different cities engage bike sharing.
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTMLmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!